🐛 Fix hang in poller task in `cancel_on_disconnect` #8017

bisgaard-itis · 2025-06-30T14:44:21Z

What do these changes do?

Joel (from the desktop team) experienced that the api-server would sometimes hang when uploading files concurrently (see issue below).
It was already established that this was due to the cancel_on_disconnect decorator.
Together with @giancarloromeo we discovered that the issue seems to be that when task.cancel() is called (on a asyncio.Task) there is no guarantee that the task will be cancelled the next time the event loop picks up the task. This is an issue in our construction because, apparently, FasAPI/Starlette doesn't return the response to the user until the poller task is done. The fix introduced here is to use an asyncio.TaskGroup. After syncing with Gemini and @giancarloromeo this seems to be the modern idiomatic way of getting "structured concurrency" in Python. I have tested this by uploading 500 files concurrently to my local deployment.

Related issue/s

resolves Ensure POST /files/content doesn't hang in api-server osparc-issues#1922

How to test

Dev-ops

Food for thought

What if the poller tasks finishes first? In that case the handler task will be cancelled. But apparently there is no guarantee that the task will be cancelled next time the event loop picks up the task. So the handler task can live on for a while. So the question is: Is it even worth having the added complexity of the handle_on_disconnect decorator?
Note also: A more modern way to do this would be using a asyncio.TaskGroup

After discussing this issue with Gemini, here's the main explanation for the deadlock encountered here:

  Here is the step-by-step flow under this corrected understanding:


   1. Handler Finishes: Your handler_task completes.
   2. Cleanup Begins: The wrapper starts cleaning up the poller_task.
   3. Timeout Occurs: await asyncio.wait_for(poller_task, timeout=3) raises asyncio.TimeoutError because the poller is still stuck in await request.is_disconnected().
   4. Wrapper Finishes: The except block catches the timeout, the finally block runs, and the wrapper function successfully executes return await handler_task. From the
      perspective of your code, the request handler has returned a value and is done.
   5. The Orphaned Task: But the poller_task was not actually killed. The wrapper simply stopped waiting for it. The poller_task is now an "orphan"—still alive and running on
       the event loop, completely detached from its parent.
   6. Server Cleanup Deadlock: The ASGI server (Starlette/Uvicorn) has received the response object from your wrapper. It sends the response back to the client. Now, it needs
       to perform its own internal cleanup for the request. This cleanup involves releasing resources associated with the request object.

  This is where the new deadlock happens.


   * The server's cleanup logic is waiting to fully release the request object and its resources.
   * The orphaned `poller_task` is still alive and holds a reference to that same request object, periodically calling await request.is_disconnected() on it.


  The server cannot fully clean up and close the connection context while your orphaned task is still actively using the request object. The server is essentially waiting
  for the poller_task to finish, but the poller_task will only finish when the client disconnects—which may never happen, as the client has already received the response
  and is likely waiting for the connection to close cleanly.

giancarloromeo · 2025-06-30T14:46:20Z

Now we deterministically exit from the poller loop. 👌

codecov · 2025-06-30T14:46:36Z

Codecov Report

❌ Patch coverage is 93.47826% with 3 lines in your changes missing coverage. Please review.
✅ Project coverage is 85.92%. Comparing base (c5371bd) to head (b8e20d2).

Additional details and impacted files

@@            Coverage Diff             @@
##           master    #8017      +/-   ##
==========================================
- Coverage   88.04%   85.92%   -2.12%     
==========================================
  Files        1907     1372     -535     
  Lines       73282    56786   -16496     
  Branches     1302      650     -652     
==========================================
- Hits        64519    48795   -15724     
+ Misses       8374     7762     -612     
+ Partials      389      229     -160

Flag	Coverage Δ
integrationtests	`64.31% <ø> (-0.02%)`	⬇️
unittests	`86.00% <93.47%> (-0.68%)`	⬇️

Components	Coverage Δ
pkg_aws_library	`∅ <ø> (∅)`
pkg_celery_library	`∅ <ø> (∅)`
pkg_dask_task_models_library	`∅ <ø> (∅)`
pkg_models_library	`∅ <ø> (∅)`
pkg_notifications_library	`∅ <ø> (∅)`
pkg_postgres_database	`∅ <ø> (∅)`
pkg_service_integration	`∅ <ø> (∅)`
pkg_service_library	`71.80% <93.47%> (+0.10%)`	⬆️
pkg_settings_library	`∅ <ø> (∅)`
pkg_simcore_sdk	`85.05% <ø> (-0.12%)`	⬇️
agent	`93.81% <ø> (ø)`
api_server	`∅ <ø> (∅)`
autoscaling	`95.89% <ø> (ø)`
catalog	`92.34% <ø> (ø)`
clusters_keeper	`99.13% <ø> (ø)`
dask_sidecar	`91.81% <ø> (-0.57%)`	⬇️
datcore_adapter	`97.94% <ø> (ø)`
director	`76.14% <ø> (ø)`
director_v2	`77.95% <ø> (-12.96%)`	⬇️
dynamic_scheduler	`96.27% <ø> (ø)`
dynamic_sidecar	`90.12% <ø> (ø)`
efs_guardian	`89.60% <ø> (ø)`
invitations	`91.44% <ø> (ø)`
payments	`92.60% <ø> (ø)`
resource_usage_tracker	`92.13% <ø> (-0.48%)`	⬇️
storage	`86.42% <ø> (ø)`
webclient	`∅ <ø> (∅)`
webserver	`88.19% <ø> (+0.05%)`	⬆️

Continue to review full report in Codecov by Sentry.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update c5371bd...b8e20d2. Read the comment docs.

🚀 New features to boost your workflow:

📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

giancarloromeo

What a hot afternoon ☀️

pcrespov

Please, add test in service-library that reproduce the scenario tatyou are trying to cover

bisgaard-itis · 2025-07-01T05:33:42Z

After a very useful conversation with my new best friend Gemini, here's the best understanding I have of how to handle asyncio tasks. I will "publish" it here because I think it is very insightful:

The Critical Difference: Where the Program Hangs

So what is the "guarantee" that TaskGroup provides?

The guarantee is not that the task will be killed instantly. The guarantee is structural: The TaskGroup guarantees that your program will not proceed past the async with block, leaving an orphaned task behind.

Let's compare the two scenarios:

Scenario 1: Manual Cancellation (The "No Guarantee" case)

   1 async def my_parent_task():
   2     task = asyncio.create_task(un_cancellable_task())
   3     task.cancel()
   4     # My parent task MOVES ON, thinking it cancelled the child.
   5     # The child is now an ORPHAN, running forever in the background,
   6     # potentially holding resources and causing silent bugs.
   7     print("Parent task finished, but the problem is hidden.")

Here, the guarantee fails because your parent task has a completely incorrect view of the system's state.

Scenario 2: Using TaskGroup

    1 async def my_parent_task():
    2     try:
    3         async with asyncio.TaskGroup() as tg:
    4             tg.create_task(un_cancellable_task())
    5             # Let's say another task in the group finishes,
    6             # triggering the cleanup and cancellation of un_cancellable_task().
    7     except Exception:
    8         pass
    9 
   10     # THIS LINE IS NEVER REACHED.
   11     # The `async with` block itself hangs forever, waiting for the
   12     # un_cancellable_task to finish.
   13     print("Parent task finished.")

Here, the TaskGroup surfaces the problem. It doesn't hide it. The program hangs at the exact point where the misbehaving task is preventing cleanup. You now have a clear
place to debug. The guarantee is that you cannot have "zombie" or "orphan" tasks that you don't know about. The TaskGroup forces you to confront them.

The Real Solution: Combining TaskGroup with Timeouts

So, how do you get the best of both worlds? A guarantee of no orphans, and a guarantee that your parent task won't hang forever?

You wrap the entire TaskGroup in a timeout. This is the ultimate pattern for robust concurrent code.

    1 import asyncio
    2 
    3 async def un_cancellable_task():
    4     print("Un-cancellable task started and will ignore cancellation.")
    5     # This task has no await, so it can't be cancelled.
    6     for i in range(10_000_000_000):
    7         pass
    8     print("This will never be printed.")
    9 
   10 async def main():
   11     print("Starting main task.")
   12     try:
   13         # Python 3.11+ has the cleaner `asyncio.timeout()`
   14         async with asyncio.timeout(2):
   15             async with asyncio.TaskGroup() as tg:
   16                 tg.create_task(un_cancellable_task())
   17                 # The TaskGroup will try to cancel the task when the
   18                 # outer timeout is triggered.
   19 
   20     except asyncio.TimeoutError:
   21         print("The entire operation timed out after 2 seconds.")
   22         print("The un-cancellable task is now orphaned, but we KNOW it.")
   23         print("Our main program flow is safe and can continue.")
   24 
   25     print("Main task has finished.")

This is the pinnacle of asyncio safety:

The TaskGroup ensures all tasks within the operation are managed as a single unit.
The asyncio.timeout() ensures that your application's control flow will always proceed, even if one of the tasks inside the group misbehaves and refuses to cancel.

giancarloromeo · 2025-07-01T06:23:25Z

Thanks for further investigation, it's similar to the solution adopted for Celery's tasks monitoring. 👌

sanderegg

Please consider using RequestCancellationMiddleware, which is tested. Also check if that one is having the problem you discovered.
I would also prefer we re-use that code than have 2 different ones ideally.

services/api-server/src/simcore_service_api_server/api/routes/files.py

packages/service-library/src/servicelib/fastapi/requests_decorators.py

bisgaard-itis · 2025-07-04T12:20:32Z

I would still like to have 1 code for this and not 2 different ones. Also, you are aware that @cancel_on_disconnect will also cancel any Starlette BackgroundTask in the request right?

OK, I will refactor to use only one version of the code in both of these. But I will keep both ways: a decorator and a middleware.

Yes, I am aware that @cancel_on_disconnect will cancel a Starlette Background task. We have one endpoint in the api-server where we use those (the logstreaming) and we don't use this decorator there. So that should be fine.

sanderegg

I am a bit confused now, as we went through these yesterday with @pcrespov and this again looks different. can you maybe sync? then I am happy to approve. thanks and sorry for the mess.

pcrespov · 2025-07-05T10:02:08Z

@bisgaard-itis I had some interesting experiences with these cancellations in my PR #8014 where I added uvloop also in the testing (up to now we were using uvloop in production and asyncio std loop for testing). We can talk on Monday about it

This reverts commit fe5b4ca.

pcrespov

please reassign back to me when checked requested changes. thx :-)

sonarqubecloud · 2025-08-07T11:46:50Z

Quality Gate passed

Issues
0 New issues
0 Accepted issues

Measures
0 Security Hotspots
0.0% Coverage on New Code
0.0% Duplication on New Code

See analysis details on SonarQube Cloud

bisgaard-itis added 2 commits June 30, 2025 16:13

readd cancel on disconnect decorator

958f64f

add asyncio.Event for killing poller task

176775f

bisgaard-itis requested a review from pcrespov as a code owner June 30, 2025 14:44

bisgaard-itis self-assigned this Jun 30, 2025

bisgaard-itis added bug buggy, it does not work as expected a:apiserver api-server service labels Jun 30, 2025

bisgaard-itis added this to the Engage milestone Jun 30, 2025

bisgaard-itis requested review from giancarloromeo, matusdrobuliak66 and sanderegg June 30, 2025 14:44

bisgaard-itis assigned giancarloromeo Jun 30, 2025

add comment

c376bab

giancarloromeo approved these changes Jun 30, 2025

View reviewed changes

pcrespov requested changes Jun 30, 2025

View reviewed changes

bisgaard-itis added 2 commits July 1, 2025 07:02

add test for request handler cancel_on_disconnect @pcrespov

8812a8a

use taskgroup for error handling

a80f4e1

add old names to tasks

60e99d0

sanderegg requested changes Jul 1, 2025

View reviewed changes

services/api-server/src/simcore_service_api_server/api/routes/files.py Show resolved Hide resolved

packages/service-library/src/servicelib/fastapi/requests_decorators.py Show resolved Hide resolved

bisgaard-itis requested a review from pcrespov July 1, 2025 11:02

bisgaard-itis added 2 commits July 1, 2025 15:33

use event to kill poller task to be sure it terminates

e6c42f7

Merge branch 'master' into fix-hang-in-poller-task

7311ef1

bisgaard-itis requested a review from sanderegg July 2, 2025 07:03

bisgaard-itis added 4 commits July 2, 2025 09:04

Merge branch 'master' into fix-hang-in-poller-task

e7c97de

pylint

1229fa3

fix test

4cbc9dc

Merge branch 'master' into fix-hang-in-poller-task

df8d9c7

Merge branch 'master' into fix-hang-in-poller-task

321fa62

bisgaard-itis added 8 commits July 4, 2025 14:23

@pcrespov readd comments

e22808a

readd old tests

5399294

readd

97afe88

factor out core funcionality into run_until_cancelled

373ad5c

fix tests

ef3911e

ensure function decorator works on local deployment

a626a29

migrate middleware to new implementation

edd826e

improve types

241a0e9

bisgaard-itis requested review from pcrespov and sanderegg July 4, 2025 21:33

Merge branch 'master' into fix-hang-in-poller-task

1c091a1

sanderegg requested changes Jul 5, 2025

View reviewed changes

bisgaard-itis added 4 commits July 7, 2025 06:25

fix request cancellation test

431df54

improve naming

fe5b4ca

Revert "improve naming"

62808a8

This reverts commit fe5b4ca.

Merge branch 'master' into fix-hang-in-poller-task

8c2e369

pcrespov reviewed Jul 7, 2025

View reviewed changes

giancarloromeo removed their assignment Jul 7, 2025

bisgaard-itis added 3 commits July 16, 2025 09:51

Merge branch 'master' into fix-hang-in-poller-task

6f6d755

Merge branch 'master' into fix-hang-in-poller-task

f0cfe6b

Merge branch 'master' into fix-hang-in-poller-task

b8e20d2

bisgaard-itis modified the milestones: Engage, Voyager Aug 14, 2025

bisgaard-itis mentioned this pull request Aug 14, 2025

Ensure POST /files/content doesn't hang in api-server ITISFoundation/osparc-issues#1922

Closed

bisgaard-itis modified the milestones: Voyager, Cheops Sep 8, 2025

🐛 Fix hang in poller task in cancel_on_disconnect #8017

Are you sure you want to change the base?

🐛 Fix hang in poller task in cancel_on_disconnect #8017

Uh oh!

Conversation

bisgaard-itis commented Jun 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What do these changes do?

Related issue/s

How to test

Dev-ops

Food for thought

Uh oh!

giancarloromeo commented Jun 30, 2025

Uh oh!

codecov bot commented Jun 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

giancarloromeo left a comment

Choose a reason for hiding this comment

Uh oh!

pcrespov left a comment

Choose a reason for hiding this comment

Uh oh!

bisgaard-itis commented Jul 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

giancarloromeo commented Jul 1, 2025

Uh oh!

sanderegg left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

bisgaard-itis commented Jul 4, 2025

Uh oh!

sanderegg left a comment

Choose a reason for hiding this comment

Uh oh!

pcrespov commented Jul 5, 2025

Uh oh!

pcrespov left a comment

Choose a reason for hiding this comment

Uh oh!

sonarqubecloud bot commented Aug 7, 2025

Quality Gate passed

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

🐛 Fix hang in poller task in `cancel_on_disconnect` #8017

🐛 Fix hang in poller task in `cancel_on_disconnect` #8017

bisgaard-itis commented Jun 30, 2025 •

edited

Loading

codecov bot commented Jun 30, 2025 •

edited

Loading

bisgaard-itis commented Jul 1, 2025 •

edited

Loading